Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos / ARM support for vllm #2244

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pathorn
Copy link

@pathorn pathorn commented Dec 22, 2023

Built on top of a rebased version of:

Build instructions:

Make sure to install openmp from https://mac.r-project.org/openmp/

Not using docker. Tested in miniconda3 env

To build:

Just the .so:

VLLM_TARGET_DEVICE=cpu python3 setup.py build_ext

Build the full thing:

VLLM_TARGET_DEVICE=cpu python3 setup.py build
VLLM_TARGET_DEVICE=cpu python3 setup.py install
# Seems to give an error the first time.
# error: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

# I can't reproduce the error if I run it again. It does seem to run.
VLLM_TARGET_DEVICE=cpu python3 setup.py install

(NOTE: The .so file will not load on macos if your current working directory is the repository root. Change directory before running.)

Run the api server with

# Make sure you are not in the vllm repository root.
python3 -m vllm.entrypoints.openai.api_server --device cpu --enforce-eager --port 8000

Testing

Tested a few prompts on Mistral-7B and verified against the output returned by DeepInfra.

curl http://localhost:8000/v1/completions -H 'Content-Type: application/json' -d '{"model":"mistralai/Mistral-7B-Instruct-v0.1","temperature":0.0,"stop":"side","prompt":"Why did the chicken cross the road?\n","stop":"side", "max_tokens":50}'; echo

Mistral 7B gets about 0.6 tokens per second on my plain Macbook M3. It's not going to be as fast as a dedicated GPU, and does not support GPTQ (such as 4bit quantization), so llama.cpp knocks this PR out of the park in terms of performance, getting around 20 TPS on the same macbook hardware.
Screenshot 2023-12-22 at 5 48 36 AM

I do also hope this PR serves as documentation for some of the new bfloat16 ARM instructions added in recent years, for example vcvtq_high_bf16_f32(vcvtq_low_bf16_f32(a), b) to convert between bf16 and fp32, or the implementation of fused multiply-add which took a few hours of research, due to the vbfmlaltq and vbfmlalbq bfloat instructions (The ARM equivalent of _mm512_dpbf16_ps ("DPBF16PS") on AVX512 being almost completely ungooglable. I think I was a bit intrigued to be one of the first people to use these ARMv8.6-a+bf16 instructions or write about them:

inline FP32Vec16 fma(BF16Vec32 &a, BF16Vec32 &b, FP32Vec16 &c) {
  return FP32Vec16(float32x4x4_t({
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[0], a.reg.val[0], b.reg.val[0]), a.reg.val[0], b.reg.val[0]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[1], a.reg.val[1], b.reg.val[1]), a.reg.val[1], b.reg.val[1]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[2], a.reg.val[2], b.reg.val[2]), a.reg.val[2], b.reg.val[2]),
    vbfmlaltq_f32(vbfmlalbq_f32(c.reg.val[3], a.reg.val[3], b.reg.val[3]), a.reg.val[3], b.reg.val[3]),
  }));
}

Fixes

@nivibilla
Copy link

Is there scope to make use of metal backend to improve performance?

@pathorn
Copy link
Author

pathorn commented Dec 24, 2023

My original goal for this project was to spend 1 day to get something that I can use for development on macos. Using CPU vector instructions was sufficient to achieve my goal, so I will not promise to put effort into developing a metal backend.

One good thing this change does is it adds a --device command line option: in theory, it might be possible to pass --device mps . PyTorch already supports the mps backend, but you are going to have to figure out a plan for porting the cuda code in csrc/ to metal.

The good news is there is now a cpu implementation in csrc/ in addition to cuda, with a wrapper function named _dispatch. The dispatch function and two example implementations might make it easier to add an additional mps / metal implementation. But I am sorry to say that I do not have time to help with this.

Edit: I want to be clear and temper expectations. Once again, my goal was merely to be able to run VLLM locally and speed up development. This code does not implement GPTQ support, so in terms of CPU execution, you would be much better served by using llama.cpp which can achieve 20 TPS of Mistral-7B at 4bit on a macbook, rather than this version which achieves 0.7 TPS as bfloat16.

@nivibilla
Copy link

Np, thanks for explaining

@sandangel
Copy link

sandangel commented Jan 18, 2024

hi @pathorn , have you looked at https://github.com/ml-explore/mlx ? I think it will give better performance running on mac without much effort and will utilize Unified Memory on Apple Silicon devices. Do you think it's not a complex task to add a wrapper for mlx and use it as a vllm backend for mac devices? I'm looking into that but my limit understanding of ML dev make it look like not an easy task for me. They have many examples here: https://github.com/ml-explore/mlx-examples including running inference on mac. What I can do best propably using the interface from the wrapper and connect it with the OpenAI API endpoints.

@rahuja23
Copy link

rahuja23 commented Feb 7, 2024

Is this issue resolved and merged? I am trying to install vllm package on my macM1 via pip and I get the following error:

  error: subprocess-exited-with-error
  
  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [31 lines of output]
      /private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
        device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
      Traceback (most recent call last):
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/homebrew/Caskroom/miniforge/base/envs/GPT-FROM-SCRATCH/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 132, in get_requires_for_build_editable
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 441, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 354, in <module>
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1076, in CUDAExtension
          library_dirs += library_paths(cuda=True)
                          ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1203, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
                                 ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/0j/z708xtw54fnc_l45_x3l1kvr0000gn/T/pip-build-env-85fb99p7/overlay/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2416, in _join_cuda_home
          raise OSError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

@bluenevus
Copy link

+1

@pathorn
Copy link
Author

pathorn commented Mar 7, 2024

Rebased onto latest vllm. Tested with OPT-175B

Still missing some ops such as gelu_and_mul such as used in gemma.

@pathorn
Copy link
Author

pathorn commented Mar 7, 2024

Rebased and updated with gelu ops for Gemma.

google/gemma-2b runs pretty well on my macbook.

Updated the build instructions. Does not currently install via pip due to need to add the VLLM_TARGET_DEVICE=cpu environment variable.

@BodhiHu
Copy link

BodhiHu commented Mar 23, 2024

The mlc-llm/tvm is perhaps more suited for this. Tested on M1 with really good performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants